Add flash attention and conv2d direct controls for image generation #1678

wbruna · 2025-08-08T22:46:50Z

This is a straight implementation that adds controls and command-line switches for the flash_attention, diffusion_conv_direct and vae_conv_direct flags.

For Conv2D Direct, see leejet/stable-diffusion.cpp#744 for details. But in a nutshell: VAE decoding can be ~2.4-3.8x faster, with less than a half memory usage (allowing 1024x1024 generations without VAE tiling). The diffusion process currently doesn't seem to benefit from it, so the main reason to enable the control would be exposing the functionality to more testing.

The main drawback is a possible crash for unsupported backends. We could in principle only honor the flag for known to be working configurations, or blacklist broken ones. On my tests, it's currently working on CPU and Vulkan, while ROCm crashes.

The separate flash attention control is useful to avoid certain bugs, like leejet/stable-diffusion.cpp#756 , and configurations where flash attention benefits text generation, but slows down image generation.

Depends on #1669 .

LostRuins · 2025-08-13T10:46:34Z

In what scenarios does it break (is it model dependent) or is it purely backend dependent? Also, are there any drawbacks to it quality/perf wise?

If it's an all round improvement, rather than overcomplicate with too many flags, would it be better to just force-enable it for supported backends? or is it conditionally broken?

wbruna · 2025-08-13T11:37:23Z

It directly depends on the implementation of a specific operation, so it's unlikely to crash on a specific model (except in the sense that a model family which doesn't use that operation would of course always work, but I think all currently do).

In terms of performance, it's a bit too early to be sure. When it helps, it helps a lot, but it's not clear if it could get worse for some backend+model+hardware combinations. Quality seems to be the same, although outputs may vary a bit; in my tests, the VAE output changes less than it does with tiling.

Another possibility could be enabling the flags unconditionally for known-to-be-working cases, but including an environment variable to work around possible issues (either forcing it on to test a new backend, or off to avoid a special case).

wbruna · 2025-08-13T14:39:30Z

it's not clear if it could get worse for some backend+model+hardware combinations

Already found an example - my own card! VAE timings for rendering a 512x512 SDXL image, radv versus amdvlk:

	radv	amdvlk
base	3.54s	3.74s
vae-conv-direct	0.86s	11.46s

It's even more dramatic for diffusion-conv-direct : ~20% worse on radv, 5x worse on amdvlk.

LostRuins · 2025-08-16T03:39:24Z

Alright how about this:

we can add the extra new toggle for SD flash attention
we add a new toggle for conv direct, but we combine both of them into a single flag for simplicity. The flag toggles them both. How about --sdconvdirect (avoid 'dir' as that usually means directory here)

wbruna · 2025-08-16T15:54:25Z

we add a new toggle for conv direct, but we combine both of them into a single flag for simplicity. The flag toggles them both. How about --sdconvdirect (avoid 'dir' as that usually means directory here)

A single boolean option would be a bit annoying, to be honest :-) My card gets 20% slower for the diffusion step, so it could be slower overall for high step counts, even if the VAE is much faster.

But a non-boolean sdconvdirect would be fine. I've implemented one with three values: disabled, enabled, and 'vaeonly' to enable it only for the VAE (I don't think we'll find a case where only the the VAE would be slower, but it's easy to add an extra value for that if you wish).

LostRuins · 2025-08-17T02:20:58Z

Anyway let me know when its ready to review

wbruna · 2025-08-17T03:49:54Z

Should be good to go now... except maybe for the combobox control. I tried to follow the other labeled controls as examples; it works, but you'll probably want to adjust its position and padding. It's being called only at one point right now, but I may make use of it in another (unrelated) PR.

LostRuins · 2025-08-19T08:34:24Z

You added sd_diffusion_convdir_var but it is not used anywhere

wbruna · 2025-08-19T09:55:09Z

Cleaned up that var, and rebased on top of concedo_experimental.

Edit: while working on #1692 , I started to think on and off could be better than enabled and disabled; or maybe off, vaeonly and full. I don't really have a strong preference, but maybe it'd be best to choose values that could be the same for other options, for consistency?

LostRuins · 2025-08-19T15:40:15Z

I think the disambiguation is good. So long as the commands are clear it's good. Generally I try to avoid enabled/on/off/disabled flags because that is implied with store true, i.e. --usemmap means mmap is enabled. So if 3 options are needed I think keeping it explicit is ok

LostRuins

should be good, merging

wbruna force-pushed the kcpp_sd_perf_flags branch 7 times, most recently from 07d0e92 to 941e502 Compare August 12, 2025 15:58

wbruna marked this pull request as ready for review August 12, 2025 15:58

LostRuins force-pushed the concedo_experimental branch from 1977086 to c6f7603 Compare August 14, 2025 16:16

wbruna force-pushed the kcpp_sd_perf_flags branch from 941e502 to 15a4f9f Compare August 16, 2025 14:51

wbruna force-pushed the kcpp_sd_perf_flags branch from 15a4f9f to d485089 Compare August 17, 2025 01:01

wbruna force-pushed the kcpp_sd_perf_flags branch from d485089 to e69878c Compare August 17, 2025 03:28

wbruna mentioned this pull request Aug 17, 2025

Support q8_0 quantization for image model loading #1692

Merged

LostRuins added the enhancement New feature or request label Aug 19, 2025

Add separate flash attention config for image generation

7444c3c

wbruna force-pushed the kcpp_sd_perf_flags branch from 47326af to df2a43f Compare August 19, 2025 09:47

Add config option for Conv2D Direct

ab2f092

wbruna force-pushed the kcpp_sd_perf_flags branch from df2a43f to ab2f092 Compare August 19, 2025 21:55

LostRuins approved these changes Aug 20, 2025

View reviewed changes

LostRuins merged commit 6003e90 into LostRuins:concedo_experimental Aug 20, 2025

wbruna deleted the kcpp_sd_perf_flags branch August 22, 2025 16:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add flash attention and conv2d direct controls for image generation #1678

Add flash attention and conv2d direct controls for image generation #1678

Uh oh!

wbruna commented Aug 8, 2025 •

edited

Loading

Uh oh!

LostRuins commented Aug 13, 2025

Uh oh!

wbruna commented Aug 13, 2025

Uh oh!

wbruna commented Aug 13, 2025

Uh oh!

LostRuins commented Aug 16, 2025

Uh oh!

wbruna commented Aug 16, 2025

Uh oh!

LostRuins commented Aug 17, 2025

Uh oh!

wbruna commented Aug 17, 2025

Uh oh!

LostRuins commented Aug 19, 2025

Uh oh!

wbruna commented Aug 19, 2025 •

edited

Loading

Uh oh!

LostRuins commented Aug 19, 2025

Uh oh!

LostRuins left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add flash attention and conv2d direct controls for image generation #1678

Add flash attention and conv2d direct controls for image generation #1678

Uh oh!

Conversation

wbruna commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LostRuins commented Aug 13, 2025

Uh oh!

wbruna commented Aug 13, 2025

Uh oh!

wbruna commented Aug 13, 2025

Uh oh!

LostRuins commented Aug 16, 2025

Uh oh!

wbruna commented Aug 16, 2025

Uh oh!

LostRuins commented Aug 17, 2025

Uh oh!

wbruna commented Aug 17, 2025

Uh oh!

LostRuins commented Aug 19, 2025

Uh oh!

wbruna commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LostRuins commented Aug 19, 2025

Uh oh!

LostRuins left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wbruna commented Aug 8, 2025 •

edited

Loading

wbruna commented Aug 19, 2025 •

edited

Loading